Distributions in R

The density function

The density function starts with d and shows the “likelihood of different values x given the distribution parameters. For instance dnorm(1.85, 1.7, 0.1) gives the likelihood of the value 1.85 given a normal distribution with a mean of 1.7 and a standard deviation of 0.5.

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
curve(dnorm(x,1.7,.1),  # function y = f(x) for value on y axis
      1.3,              # minimum x
      2.1,              # maximum x
      ylab = "density")

Generating random samples

The random generation function (cdf) starts with r and and generates random values x given the distribution parameters.

## [1] FALSE

For instance rnorm(500, 1.7, 0.1) will produce 500 random values that have a mean of 1.7 and a standard deviation of 1. We can call these 500 values samples from the distribution.

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
samples = rnorm(500, 1.7, 0.1)
plot(samples, ylab = "value", xlab = "sample #")

If we have samples and want to see how they are distributed, we can use a histogram or a density plot. It is better to use histograms than density plots, because the latter involve estimation of the smoothness of the density, which can lead to artefacts.

par(mar=c(3,3,3,1), mgp=c(1.5,.5,0), tck=-.01)
par(mfrow = c(1,2))
hist(samples, main = "Histogram")
plot(density(samples), main = "Density plot")

Normal distribution & central limit theorem

Lets assume the monthly growth rate follows following distribution:

What is the distribution of heights of 10000 children at different ages?

This example illustrates the central limit theorem, which states that the result of processes that manifest as the sum of many small identical and independently distributed events are normally distributed.

One way to explain why this is the case is to see that there are more possible combinations of events that lead to average outcomes than possible combination of events that lead to extreme events.

For instance, assume that you are throwing a fair coin four times, and each time heads shows you receive one credit point and each time tail shows you lose a credit point. The next table shows that there are more possible sequences that lead to an end result of 0 credit points than sequences that lead to 4 or more credit points.

Permutation event 1 event 2 event 3 event 4 sum
1 -1 -1 -1 -1 -4
2 1 -1 -1 -1 -2
3 -1 1 -1 -1 -2
4 1 1 -1 -1 0
5 -1 -1 1 -1 -2
6 1 -1 1 -1 0
7 -1 1 1 -1 0
8 1 1 1 -1 2
9 -1 -1 -1 1 -2
10 1 -1 -1 1 0
11 -1 1 -1 1 0
12 1 1 -1 1 2
13 -1 -1 1 1 0
14 1 -1 1 1 2
15 -1 1 1 1 2
16 1 1 1 1 4

Now lets do the same experiment again, except that we are not looking at 4, but 16 tosses, which leads to \(2^{16}\) or 6.5536^{4} possible sequences. Here is the distribution of credit points.

If the number of “trials” is large and the success probability is close to 0.5, the binomial distribution “looks” like a normal distribution.

One popular device to display such a process is a Galton1 board:

Galton board

Multivariate distributions

It is possible to observe multiple variables as outcome of a process. For instance, we might record the length and the weight of a newborn baby. To show how these variables are related, we plot them together.

dt = data.frame(length = rnorm(5000,50,5))
expected_weight = 3.5 + scale(dt$length)*.5
dt$weight = rnorm(5000,expected_weight,.5)

par(mfrow = c(1,2))
par(mar=c(3,3,2,1), mgp=c(1.5,.5,0), tck=-.01)
with(dt,plot(length,weight, pch = 16, cex = .25, main = "Scatter plot"))
with(dt,smoothScatter(length,weight, main = "Smooth scatter plot"))

Another type of process that also leads to jointly distributed variables is Bayesian data analysis. If an analysis estimates several parameters, we can look at the posterior distribution of these parameters individually– we look at their marginal distributions–or we can look at the joint posterior distribution of the parameters.

The typical way to describe such variables that are jointly and normally distributed is to use a multivariate normal distribution. A multivariate normal distribution has parameters you already know, and one additional parameter:

  • means of the two marginal distributions (length and weight above)
  • variances (sd^2) for the two marginal distributions
  • and also the covariance of the two variables.

Here are these parameters for our birth weight and birth length data:

colMeans(dt)
##    length    weight 
## 49.851461  3.496047
cov(dt)
##           length    weight
## length 24.996613 2.4702856
## weight  2.470286 0.4963798

The variances are on the main diagonal of the covariance matrix, and the co-variances on the off diagonal.

As you probably have guessed, covariance and correlation are linked:

correlation.R = 
  cor(dt)[2]
correlation.manual = 
  cov(dt)[2]/
  prod(sqrt(diag(cov(dt))))

cbind(correlation.R,
      correlation.manual)
##      correlation.R correlation.manual
## [1,]      0.701293           0.701293

What is marginalization?

Sometimes, we want information about only one dimension of the data. This information is shown in the marginal distribution.

One way to view how the marginal distribution is calculate is to imagine that data point (or samples from a p posterior) are collapsed over one variable. Like this:

Marginalization

This is only a visualization to give you an intuition. We won’t cover here how one calculates marginal integrals. When we dealing with samples from posterior distributions, we also do not calculate integrals. In fact, just showing the histogram for one parameter of the posterior is already the display of this variables’ marginal distribution.

Here is a more traditional way to show two marginal distributions.

In this plot, each histogram shows the marginal distribution of length and weight, respectively. E.g. to get the frequency of length = 50cm, we sum all individuals with the eight, regardless of their weight.

Linear regression model

What is the association between length and weight at birth?

We simulate some data:

dt = data.frame(length = rnorm(250,50,5))
expected_weight = 3.5 + scale(dt$length)*.5
dt$weight = rnorm(250,expected_weight,.5)

When data covary, we look at e.g. a scatter plot, which shows the joint distribution, to see how the data are related.

Modeling the mean

Describing the model

To model such data, we typically think first about the likelihood function. The question here is, which distribution describes best the data we observed. Given that we can think of birth weight as the result of a sum of many processes, a normal distribution, with parameters \(\mu\) and \(\sigma\) for mean and standard deviation, makes sense.

One easy way to spot parameters is that they are typically Greek letters, whereas data variables are Roman letters.


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)


Parameters \(\mu\) and \(\sigma\) are variables that we cannot observe directly, we have to estimate them from the data and the prior. The table above already shows how they depend on the data, but we still need to add that they also depend on the prior:


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Prior \(\mu \sim Normal(3.5,1.5)\) mu ~ dnorm(3.5, 1.5)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)


And if we want to impress (or scare off) a colleague, we can write down the full model:

\[ \overset{Posterior}{P(\mu,\sigma|w)} = \frac{\prod_i \overset{Likelihood}{Normal(w_i|\mu,\sigma)} \cdot \overset{Prior}{Normal(\mu|3.5,1.5) \cdot Uniform(\sigma|0,1.5)}} {\overset{Evidence}{\int\int \prod_i Normal(w_i|\mu,\sigma) \cdot Normal(\mu|3.5,1.5) \cdot Uniform(\sigma|0,1.5) \cdot d\mu \cdot d\sigma}} \]

This looks scarier than it is.\(\prod_if\) just means that we calculate for each (individual) \(i\) the quantity behind the product sign \(\prod\) and then multiply of all these quantities.

The numerator just means to calculate the following for each combination of \(\mu\) and \(\sigma\)

  1. for each participant \(i\) calculate the product of
  • likelihood (\(Normal(w_i|\mu,\sigma)\)) and
  • prior (\(Normal(mu|3.5,1.5)\) and \(Uniform(\sigma|0,1.5)\))
  1. calculate the product of all values generated in 1.

This is not how the calculation is performed in Bayesian analysis software (expect when one uses grid approximation). Instead, methods lake Laplace approximation or MCMC are used to calculate this quantity.

The denominator is what is called the evidence, and it’s main purpose is to insure that the posterior sums to 1. If we are mainly interested in the relative plausibility of parameter values, we do not calculate the denominator because the product of likelihood and prior is already proportional to the posterior distribution. Only if we wan to do different types of analysis, like Bayes factors, is it necessary to calculate the denominator/evidence.

Click here for an explanation symbols in the numerator mean:!
  • \(\int\int ... d\mu d\sigma\) means that we are integrating over to parameters, here \(\mu\) and \(\sigma\)
  • roughly speaking, we are
    • dividing the the 2 dimensional parameter space in cells of dimensions \(d\mu\) and \(d\sigma\)
    • for each cell we calculate \(\prod_i Normal(w_i|\mu,\sigma) \cdot Normal(\mu|3.5,1.5) \cdot Uniform(\sigma|0,1.5)\) and multiply it with the size of the cell \(d\mu \cdot d\sigma\)
    • an finally we sum all these quantities

Prior predictive check

Before one estimates the model parameters one should check if the model with priors make sensible predictions. The goal is not to prior-predict data that are very similar to observed data, but to prior-predict data that pass plausibility checks and are in the same ball park as the observed data. Plausibility checks are simple things like that there are not negative birth-weights. “In the same ball park” means that the values should have approximately the same order of magnitude. For instance, newborns are a few kilogram heavy, and not 10s or 100s of kilogram. Prior predictive checks rely on domain knowledge, and can be very useful in understanding the effect of multivariate priors. We still give it a quick try for our simple example:

prior.pedictive.weights = 
  #                     mean                    sd                  
  rnorm(10000, rnorm(10000, 3.5, 1.5), runif(10000, 0, 1.5))   
my_blue = adjustcolor("blue",alpha = .5) # for model predictions
my_gray = adjustcolor("black",alpha = .5) # for data
par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
hist(prior.pedictive.weights,
     main = "", xlab = "prior predictive weights",
     col = my_blue)
text(-3.5,2000,expression(mu~"="~Normal(3.5,1.5)), pos = 4)
text(-3.5,1850,expression(sigma~"="~Uniform(3.5,1.5)), pos = 4)

This looks reasonable, even though we surely know that there are no newborns with a weight below 0 or above 8 kilogram. But the point of the prior is not to forbid impossible values. Such impossible values should just be very rare, and if they have to be allowed in order to insure a weak influence of the prior on the results, this is OK.

Note that our choice of priors is not influenced by the data we have, but by domain knowledge we have before seeing that data.

How does it look if we use non-informative priors?

It seems obvious that the first set of priors makes more sense.

Estimating the model parameters

To analyse the model, we can just use the data.frame we created above:

head(dt)
##     length   weight
## 1 49.24845 3.225774
## 2 53.76387 3.723847
## 3 57.76392 4.634252
## 4 52.00406 3.634153
## 5 52.94691 3.268657
## 6 54.08019 4.025827

Based on the quap code in the table above, we can also put the quap model. alist is a command that creates a list that holds our model.

alpha.model.list =
  alist(
    weight ~ dnorm(mu,sigma),
    mu <- alpha,
    alpha  ~ dnorm(3.5,1.5),
    sigma  ~ dunif(0,1.5)
  )

For reasons that will be clear soon, we are using a parameter \(\alpha\) as a determinent of the mean \(\mu\).

Now we can use quap to calculate the posterior distribution.

alpha.model.fit = quap(alpha.model.list, data=dt)

And we can have a first glimpse of the results with the precis function from the rethinking package:

precis(alpha.model.fit)
##            mean         sd      5.5%     94.5%
## alpha 3.5135088 0.04677821 3.4387482 3.5882694
## sigma 0.7399883 0.03309310 0.6870992 0.7928775

Posterior predictive check

Before we start interpreting our modeling results, it is always a good idea to see if the model is any good at describing the data.

To do this, we first extract the posterior from the prior:

alpha.model.post = extract.samples(alpha.model.fit,n=1e4)
# calculate mu, because quap does not return it automtically
alpha.model.post$mu = alpha.model.post$alpha
head(alpha.model.post)
##      alpha     sigma       mu
## 1 3.568703 0.7681519 3.568703
## 2 3.589680 0.7390735 3.589680
## 3 3.518113 0.7759987 3.518113
## 4 3.470779 0.7024368 3.470779
## 5 3.649114 0.7544280 3.649114
## 6 3.514508 0.7248699 3.514508

One thing we would like to check is if the observed mean is within the credible interval of the posterior for \(\mu\):

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
hist(alpha.model.post$mu, main = "",
     xlab = expression(mu), col = my_blue )
abline(v = HPDI(alpha.model.post$mu), col = "blue")
abline(v = mean(dt$weight), lwd = 2)

We also expect that most of the data lies within the posterior predicted values. First we calculate the posterior predictions, which depend on \(\mu\) and \(\sigma\):

par(mar=c(3,3,2,1), mgp=c(1.5,.5,0), tck=-.01)
posterior.predictions = 
  rnorm(nrow(alpha.model.post),
        alpha.model.post$mu,
        alpha.model.post$sigma)

hist(dt$weight, col = my_gray, main = "", xlab = "weight")
hdpi = HPDI(posterior.predictions)
abline(v = hdpi , col = "blue")
in.hdpi = mean(dt$weight > hdpi[1] & dt$weight < hdpi[2])

title(paste0(round(in.hdpi*100),"% of weights are in the 89% HDPI"))

Now let’s simulate 200 weights and see if they are associated with length:

Differently than in the observed data, the predicted data do not show an association between length and weight. This is not surprising, because we did not use length to predict weight.

Linear regression

Instead of the previous model, where each individuals weight depends only on the group mean, we want a model in which individual weights also depend on birth length.

The linear regression model (preditor not centered)

Next we “build” the new model, which also takes birth length into account.


What Previous model New model
Likelihood \(y_i \sim Normal(\mu,\sigma)\) \(y_i \sim Normal(\mu_i,\sigma)\)
Linear model \(\mu = \alpha\) \(\mu_i = \alpha + \beta X_i\)
Prior \(\alpha \sim \dots\) \(\alpha \sim \dots\)
Prior \(\beta \sim \dots\)
Prior \(\sigma \sim \dots\) \(\sigma \sim \dots\)


The key difference of the new analysis is that we describe the expected weight as a function of an additional predictor, here \(X\). \(\beta\) is the slope of the regression model, which indicates how important the variable \(X\) is to predict the outcome. This linear regression model can be visualized as follows.

Thinking about priors

Generally, birthlength varies roughly between around 30 and 70 cm and weight varies roughly between 2 and 6 kg.2 Then we can see that for length 70 - 30 = 40 and for weight 6 - 2 = 4. Finally we can calculate 4 / 40 to see that for a one unit increase of length height should increase by around 0.1.

We can use this number as a starting point to formulate a prior for the parameter \(\beta\) and then plot predictions from using this prior.

It is much harder to make a guess about the intercept (the predicted weight when length = 0), because length = 0 is so far out of the data range.

We plot some prior predictions.

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
plot(dt$length,dt$weight, xlim = c(0,65), ylim = c(-2,9))
clr = adjustcolor("blue",alpha = .5)
for (k in 1:50) {
  abline(a = rnorm(1,-1,1),
         b = rnorm(1,.1,.1),
         col = clr)
  }

The prior predictions show that our prior allow for predictions that are consistent with the data, but also for nonsensical predictions. Nevertheless, we move ahead for now.

Here is the quap model.

mu.model.list =
  alist(
    weight ~ dnorm(mu,sigma),
    mu <- a + b * length,
    a  ~ dnorm(-1.5,3),
    b  ~ dnorm(0.1,.1),
    sigma  ~ dunif(0,1.5)
  )

And we fit the quap model.

mu.model.fit = quap(mu.model.list, data=dt)
precis(mu.model.fit)
##             mean          sd        5.5%      94.5%
## a     -1.8152015 0.321830291 -2.32954847 -1.3008545
## b      0.1055694 0.006344299  0.09542995  0.1157088
## sigma  0.5115603 0.022877403  0.47499774  0.5481228

One peculiar characteristic of this analysis is that the parameter values for \(\alpha\) and \(\beta\) are dependent, they co-vary. We discussed earlier that one can have dependency between any variables, which also includes parameter values, and that we can display this dependency by plotting both variables jointly in a (smooth) scatter plot.

Here is a scatter plot (in blue) of the parameters combinations for \(\alpha\) and \(\beta\) that have the highest posterior probability. We show these high posterior probability samples on the background of the parameter space we explored (in green).

Remember, Bayesian analysis tries to find parameter combinations or configurations that have a high posterior probability given the data, likelihood function, and priors. In the next figure the prior is in green and the posterior is in dark blue.

We can’t see a lot here. Let us zoom in and highlight the samples in the posterior distribution with extreme values for \(\alpha\) and \(\beta\).

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
post = extract.samples(mu.model.fit)
smoothScatter(post$a,post$b,nrpoints = 0,
              ylab = "b", xlab = "a")
small_a.idx = which(post$a <= sort(post$a)[25])
large_a.idx = which(post$a >= sort(post$a,decreasing = T)[50])

small_a.a = post$a[small_a.idx]
small_a.b = post$b[small_a.idx]
large_a.a = post$a[large_a.idx]
large_a.b = post$b[large_a.idx]

points(small_a.a, small_a.b,col = adjustcolor("red",alpha = .5), pch = 16)
points(large_a.a, large_a.b,col = adjustcolor("blue",alpha = .5), pch = 16)

How is it possible that we see parameter combinations that have a similar posterior probability, and yet have very different parameter values? To understand this, we can plot predictions, i.e. regression lines, from such parameter values.

A linear regression tries to minimize the vertical distance between the regression line and data points on the y axis. What this plot shows us is that different parameter combinations (low \(\alpha\) with large \(\beta\) vs. high \(\alpha\) with small \(\beta\)) can lead to similar distances (likelihood values).

The correlation in the posterior can make it hard to fit more complex models. In addition, the parameters of the model we just fit are are hard to interpret: What do intercept \(\alpha\) and slope \(\beta\) mean intuitively?

To deal with these problems, some recommend to center predictor variables.

Why does centering help against covariation of intercept and slope parameters?

If we want to fit a line through the point cloud, it needs to go through the center of the cloud, because only then can we achieve that the vertical distance from the points to the regression line is on average small. If we want to change the slope and still stay in the middle of the point cloud, we hence have to change the predicted weight value for length = 0 (the intercept). If we center length, centered length = 0 will be at the horizontal center of the point cloud. In this case, we can change the slope without changing the intercept (the predicted weight value for centered length = 0).

Analysis with a centered predictor length

Centering means to subtract the mean of the predictor variable from all values. In order to standardize, one could also divide by the standard deviation, but then the interpretation of slope parameters is less intuitive for domain experts.

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
dt$length.s = 
  scale(dt$length, center = TRUE, scale = FALSE) # centering
plot(dt$length.s,
     dt$weight,
     ylab = "weight",
     xlab = "centered length",
     ylim = ylim)
abline(v = 0, lty = 2, col = "grey")

Now it is easier to see that for the centered length, the intercept is actually just the weight at the average length. Therefore we can specify a prior based on our domain knowledge about average birth weights.

The linear regression model (centered predictor)

Next we “build” the new model, which can be visualized as follows:

Here is the full model


What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Linear model \(\mu_i = \alpha + \beta l_i\) mu[i] <- a + b * length.s[i]
Prior \(\alpha \sim Normal(3.5,1.5)\) alpha ~ dnorm(3.5, 1.5)
Prior \(\beta \sim Normal(.15, .5)\) beta ~ dnorm(.15, .5)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)


The only real difference to the previous model is in lines 2 and 4. mu is now also a function of length, and we added a prior for the effect of length.

Prior predictive check

We do again a prior predictive check to see if model and prior are broadly in line with the data.

We are using the abline function for this. See the documentation!!!

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
plot(0,
     ylab = "weight",
     xlab = "scaled length",
     ylim = c(1,7),
     xlim = c(-11,11))
clr = adjustcolor("blue",alpha = .5)
abline(v = 0, lty = 3, col = "grey")
for (k in 1:50)
  abline(a = rnorm(1,3.5,1.5), 
         b = rnorm(1,.15,.5),
         col = clr)

This looks pretty wild. It is implausible that we have a negative association between length and weight, and the mean weight covers an implausibly large range. We can do better than that, without that the model priors determine the fitting results.

A good rule to keep in mind is that for a prior (\(Normal(\mu,\sigma))\) 95% of the prior mass lies between \(\mu + 2\sigma\) and \(\mu - 2\sigma\).

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
plot(0,
     ylab = "weight",
     xlab = "scaled length",
     ylim = c(1,7),
     xlim = c(-11,11))
abline(v = 0, lty = 3, col = "grey")
for (k in 1:50)
  abline(a = rnorm(1,3.5,1),
         b = rlnorm(1,log(.175),.5),
         col = clr)

This looks much more reasonable. Here is the model with new priors:

What Notation quap R-code
Likelihood \(w_i \sim Normal(\mu,\sigma)\) weight ~ dnorm(mu, sigma)
Trans. paras. \(\mu_i = \alpha + \beta l_i\) mu[i] <- a + b * length.s[i]
Prior \(\alpha \sim Normal(3.5, 1)\) alpha ~ dnorm(3.5, 1)
Prior \(\beta \sim logNormal(log(.175), .5)\) beta ~ dlnorm(log(.175), .5)
Prior \(\sigma \sim Uniform(0,1.5)\) sigma ~ dunif(0, 1.5)

And here is the quap model:

mu.model.list =
  alist(
    weight ~ dnorm(mu,sigma),
    mu <- a + b * length.s,
    a  ~ dnorm(3.5,1),
    b  ~ dlnorm(log(0.175),.5),
    sigma  ~ dunif(0,1.5)
  )

Estimating the model parameters

We fit the quap model.

mu.model.fit = quap(mu.model.list, data=dt)
precis(mu.model.fit)
##            mean          sd      5.5%     94.5%
## a     3.5135078 0.032337215 3.4618267 3.5651889
## b     0.1060514 0.006336534 0.0959244 0.1161784
## sigma 0.5115638 0.022877874 0.4750005 0.5481271

Posterior predictive check

As should become custom, we check if our model describes the data reasonably well. We first extract the posterior samples.

mu.model.post = extract.samples(mu.model.fit)

Next we calculate the “linear predictor” \(\mu\) for each participant. Because we have 10000 posterior samples and 250 participants, we end up with a 10000 time 250 matrix of posterior predictions.

n_samples = nrow(mu.model.post)

get_mu.model.pp = function(x,posterior, type = "pred"){
  x = matrix(x, ncol = 1)
  mu.model.pp = 
    apply(
      x, # instead of a for loop
      1,  # for each row in dt
      function(length.s){
        with(data.frame(posterior),
             {
               mu = a + b * length.s             # calculate mu
               if (type == "lin_pred") {
                 return(mu)
               } else {
                 pp = rnorm(n_samples,mu,sigma)  # generate post. preds
                 return(pp)
               }
             }
        )
      }
    )
}

mu.model.pp = get_mu.model.pp(dt$length.s,mu.model.post)

dim(mu.model.pp)
## [1] 10000   250

And we check if the predicted weights are consistent with the observed means.

Posterior predictive checks do not need to be limited to simlpy the outcome variable. We can check if any aspect of the data we care about works out. Here we check if mean and sd of the predicted data lie within the credible interval of the posterior.

par(mfrow = c(1,2))
par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
pp.mu = apply(mu.model.pp, 1, mean)
pp.sd = apply(mu.model.pp, 1 , sd)

hist(pp.mu, main = "", col = my_blue)
abline(v = c(HPDI(pp.mu),mean(dt$weight)), 
       col = c("blue","blue","black"), lwd = 3)

hist(pp.sd, main = "", col = my_blue)
abline(v = c(HPDI(pp.sd),sd(dt$weight)), 
       col = c("blue","blue","black"), lwd = 3)

We can also check if we see the co-variation of birth length and birth weight of we use predicted weights.

Experience helps for thinking about good posterior predictive checks. One rule or guiding principle is that important characteristics of the data should still be visible in the posterior predictions.

Describe the results

Now that we have convinced of the that the model describes the data well enough (yes, this is to a degree subjective) we can present our results.

As a visual presentation we can e.g. just show the slope:

par(mar=c(3,3,0,1), mgp=c(1.5,.5,0), tck=-.01)
plot(dt$length.s,
     dt$weight,
     col = "grey", 
     ylab = "weight",
     xlab = "length",
     xaxt = "n")
x.ticks = seq(35,60,by = 5)
axis(1, 
     at = x.ticks-mean(dt$length), 
     labels = x.ticks)
for(k in 1:25) 
  abline(a = mu.model.post$a[k], 
         b = mu.model.post$b[k],
         col = adjustcolor("black",alpha = .2))

I like spaghetti plots, but confidence bands are more typically used:

link: posterior expecation, i.e. the best guess for a given x.

sim: posterior prediction, i.e. the best guess plus a random error.

This is how one could describe the results:

We found that a one standard deviation (1 sd) increase in birth length was associated with a 106 gram (credible interval 95, 116) increase of birth weight.

Splines

Splines are an alternative to polynomial regression if one wants to model non-linear relationships.

In polynomial regression, the regression weight for each basis functions changes the predicted values over the entire range of the predictor variables.

x = seq(-3,3,.1)
X.poly = poly(x, 2, raw = T)

b1 = c(1,1)
b2 = c(1,2)

y1 = X.poly %*% b1
y2 = X.poly %*% b2

par(mfrow = c(2,1), mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
matplot(X.poly, type = 'l', ylab = "basis function value", xlab = "")
title("Polynomials")
plot(x, y1, 'l', lwd = 2,
     ylim = range(c(y1,y2)),
     col = "blue",
     ylab = "y (weighted sum of basis functions)")
lines(x,y2, col = "red", lwd = 2, lty = 2)

As a result, it is not possible to fit local trends.

Splines partition the x-axis in overlapping regions, and parameter for basis functions modify the predicted values only in specific regions.

library(splines)
knot.locations = seq(-3,3, length.out = 5)
X.splines = bs(x, knots = knot.locations)

b1 = scale(1:ncol(X.splines))^2
b2 = b1
b2[(length(b2)-2):length(b2)] = 0

y1 = X.splines %*% b1
y2 = X.splines %*% b2

par(mfrow = c(2,1), mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
matplot(X.splines, type = 'l', ylab = "basis function value", xlab = "")
title("Splines")
plot(x, y1, 'l', lwd = 2,
     ylim = range(c(y1,y2)),
     col = "blue",
     ylab = "y (weighted sum of basis functions)")
lines(x,y2, col = "red", lty = 2, lwd = 2)

As a result, splines can model local variations.

A quick look at important parameters for splines.

Let’s assume that we have the following data set:

set.seed(11)
x = seq(-3,3,length.out = 250)
knot.locations = seq(-3,3, length.out = 10)
X.splines = bs(x, knots = knot.locations)
b = rnorm(ncol(X.splines))
pre.y = X.splines %*% b
y = pre.y + rnorm(length(x),0,.3)
plot(x,y)

df = data.frame(x = x, y = y)

If we want to fit such a data set, we have to decide into how many (overlapping) sections, each of which is modeled by a basis function, we split the space along the x axis. This is done with the knot parameter of the bs function in the splines package:

par(mfrow = c(2,1), mar=c(3,3,2,1), mgp=c(1.2,.25,0), tck=-.01)
n_knots = 5
knot_list = quantile(df$x,probs=seq(0,1,length.out=n_knots))
B = bs(df$x, knots=knot_list, degree=3)
matplot(df$x,B,type = 'l', main = "5 knots")
n_knots = 12
knot_list = quantile(df$x,probs=seq(0,1,length.out=n_knots))
B = bs(df$x, knots=knot_list, degree=3)
matplot(df$x,B,type = 'l', main = "12 knots")

Another thing we can decide is how large we allow weights \(\omega\) for the Basis functions become.

To understand this, we inspect this quap model:

quap.spline.model = 
  alist(
    y ~ dnorm(mu,sigma),
    mu <- B %*% omega ,
    omega ~ dnorm(0,1),
    sigma ~ dexp(1)
  )

And we plot pior predictions for a handful of randomly choose parameters \(\omega\)

quap.spline.model = 
  alist(
    y ~ dnorm(mu,sigma),
    mu <- B %*% omega ,
    omega ~ dnorm(0,[0.1,1]),
    sigma ~ dexp(1)
  )

The final figure shows the influence of number of knots (in rows) and the standard deviation for omega (in columns). The figure shows randomly generated predictions from model (prior predictions) which highlights that prior predictions are useful to understand model parameters and priors.

par(mfrow = c(2,2), mar=c(3,3,2,1), mgp=c(1.2,.25,0), tck=-.01)
for (n_knots in c(5,20)) {
  for(sd in c(0.1,1)) {
    plot(df$x,df$y, col = "grey", xlab = "x")
    knot_list = quantile(df$x,probs=seq(0,1,length.out=n_knots))
    B = bs(df$x, knots=knot_list, degree=3)
    B %*% matrix(rnorm(ncol(B)*5,sd = sd),nrow = ncol(B)) %>% 
      matlines(df$x,.,type = 'l', col = "blue", lty = 1, ylim = c(-2,2))
    title(bquote("N(knots) = " ~ .(n_knots) ~", sd("~omega~") = " ~ .(sd)))
  }
}

Which model configuration, number of knots and standard deviation for \(\omega\) looks most reasonable for you?

If we use data to inform the prior, this should preferably not be the data you are planning to anlyze, but related similar data. Strictly speacking, the data should only influence the likelihood through the prior. However, the concept of prior predictive checks runs a bit counter to that view, and some argue that it is OK to look at the data to get a broad idea about what priors are reasonable.


  1. Who certainly was clever, but is nowadays also infamous for his views on eugenics and race.↩︎

  2. I made this up, in a real analysis one might consult some statistics that are available.↩︎